import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set(font_scale=2)
sns.set_style("whitegrid")

FIFA Dataset¶

We will be looking at the FIFA 2018 Dataset. While this is a video game, the developers strive to make their game as accurate as possible, so this data reflects the skills of the real-life players.

Let's load the data frame using pandas.

df = pd.read_csv("FIFA_2018.csv",encoding = "ISO-8859-1",index_col = 0, low_memory = False)

We can take a brief look at the data by calling df.head(). The first 34 columns are attributes that describe the behavior (e.g. aggression) or the skills (e.g. ball control), of each player. The final columns show the player's position, name, nationality, and the club they play for.

The four positions are forward (FWD), midfielder (MID), defender (DEF), and goalkeeper (GK).

df.head()

We already know that identifying goal-keepers is quite straight-forward, so let's remove the data corresponding from goal-keepers:

df2 = df[df["Position"] != "GK"].copy()
df2.drop(['GK diving',
 'GK handling',
 'GK kicking',
 'GK positioning',
 'GK reflexes'],1,inplace=True)

We can get all the attribute names and store them as labels by using .columns.values

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score,  precision_score, recall_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier

validation_size = 0.3
seed = 7
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=validation_size, random_state=seed)

print('%30s  %16s' % ("Classifier","accuracy") )
for name, clf in list(dict_classifiers.items()):
    
    clf.fit(X_train, Y_train)
    y_result = clf.predict(X_test)
    acc = accuracy_score(Y_test, y_result)
    print('%30s  %16f' % (name, acc) )
    cmat  = confusion_matrix(Y_test, y_result,labels=["DEF","MID","FWD"])
    print(cmat)

                    Classifier          accuracy
             Nearest Neighbors          0.770999
[[1376  240   14]
 [ 342 1608  231]
 [   7  262  706]]
                           LDA          0.790430
[[1395  231    4]
 [ 308 1632  241]
 [   7  212  756]]
  Gradient Boosting Classifier          0.797743
[[1430  194    6]
 [ 326 1685  170]
 [   7  265  703]]
                 Random Forest          0.711241
[[1258  338   34]
 [ 398 1486  297]
 [  20  295  660]]
                   Naive Bayes          0.726285
[[1290  333    7]
 [ 404 1343  434]
 [   5  127  843]]
             Linear Regression          0.792938
[[1405  220    5]
 [ 305 1668  208]
 [   8  245  722]]

	Acceleration	Aggression	Agility	Balance	Ball control	Composure	Crossing	Curve	Dribbling	Finishing	...	Sprint speed	Stamina	Standing tackle	Strength	Vision	Volleys	Position	Name	Nationality	Club
0	89	63	89	63	93	95	85	81	91	94	...	91	92	31	80	85	88	FWD	Cristiano Ronaldo	Portugal	Real Madrid CF
1	92	48	90	95	95	96	77	89	97	95	...	87	73	28	59	90	85	FWD	L. Messi	Argentina	FC Barcelona
2	94	56	96	82	95	92	75	81	96	89	...	90	78	24	53	80	83	FWD	Neymar	Brazil	Paris Saint-Germain
3	88	78	86	60	91	83	77	86	86	94	...	77	89	45	80	84	88	FWD	L. Surez	Uruguay	FC Barcelona
4	58	29	52	35	48	70	15	14	30	13	...	61	44	10	83	70	11	GK	M. Neuer	Germany	FC Bayern Munich